Exploratory Data analysis (EDA)

Analyzing the data sets to summarize their main characteristics of variables, often with visual graphs, without using a statistical model.

1. Overview of the data

Understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables

# Overview of the data
ExpData(data=data,type=1)
# Structure of the data
ExpData(data=data,type=2)
Overview of the data
Structure of the data

Target variable

Summary of continuous dependent variable

  1. Variable name - mpg
  2. Variable description - ****

2. Summary of numerical variables

Summary statistics when dependent variable is Continuous mpg.

ExpNumStat(data,by="A",gp=Target,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)

3. Distributions of numerical variables

Graphical representation of all numeric features, used below types of plots to explore the data

  • Quantile-quantile plot (Univariate)
  • Density plot (Univariate)
  • Scatter plot (Bivariate)

Quantile-quantile plot for Numerical variables - Univariate

Quantile-quantile plot for all Numerical variables

ExpOutQQ(data,nlim=4,fname=NULL,Page=c(2,2),sample=sn)
## $`0`

Density plots for numerical variables - Univariate

Density plot for all numerical variables

ExpNumViz(data,target=NULL,nlim=10,fname=NULL,col=NULL,theme=theme,Page=c(2,2),sample=sn)
## $`0`

Scatter plot for all Numeric variables

Scatter plot between all numeric variables and target variable mpg. This plot help to examine how well a target variable is correlated with list of dependent variables in the data set.

ExpNumViz(data,target=NULL,nlim=5,Page=c(2,1),theme=theme,sample=sn,scatter=TRUE)
## $`0`

Correlation between dependent variable vs Independent variables

Dependent variable is mpg (continuous).

ExpNumViz(data,target=Target,nlim=5,fname=NULL,col=NULL,theme=theme,Page=c(2,2),sample=sn)
## $`0`

** Correlation summary table

ExpNumStat(data,by="GA",gp=Target,MesofShape=2,Outlier=FALSE,round=2,dcast=T,val="cor")

4. Summary of categorical variables

Summary of categorical variables

  • frequency for all categorical independent variables
ExpCTable(data,margin=1,clim=10,nlim=5,round=2,per=T)
  • frequency for all categorical independent variables by descretized mpg
##bin=4, descretized 4 categories based on quantiles
ExpCTable(data,Target=Target,margin=1,clim=10,nlim=5,round=2,bin=4,per=T)

5. Distributions of Categorical variables

Graphical representation of all Categorical variables

  • Bar plot (Univariate)

Bar plot with vertical or horizontal bars for all categorical variables

ExpCatViz(data,clim=10,margin=2,theme=theme,Page = c(2,2),sample=sc)
## $`0`